Due: Sep 10 (Tuesday) 11:00 pm
Amit Shetty
In this assignment, you are getting familiar to the tools including Python, numpy, matplotlib, pandas, and Jupyter notebook. Search for data one for classification and one for regression from any data source. The data should to be large enough more than 10,000 samples and more than 10 feature values.
The section on linear algebra details the basics on thew topic needed to understand machine learning and deep learning algorithms. Relevant aspects of linear algebra are discussed in a concise way to provide a one stop shop to understand the language that most if not all AI algorithms use.
The following portions of linear algebra and a short summary of each is discussed below
Scalars: A scalar is just a single number which may take up different types of values, e.g. real-valued (slope of a line), natural number (number of units), etc.
Vectors: A vector is an array of numbers of the same type (e.g. x ∈ ℝ), arranged in order and indexed as x1, x2 .... xn. They are expressed as: [x1, x2, ...xn]
Matrices: A matrix is a 2-D array of elements, each element being indexed by two numbers. A real-valued matrix A of height m and width y is represented as A \in ℝ ^ {mxn} . The element in the i^{th} row and j^{th} column is indexed as A{i, j}. f(A){i, j} represents the element (i, j) of the matrix computed by applying the function f to A.
Tensors: An array of numbers arranged in a regular grid with a variable number of axes is known as a tensor. The element at coordinates (i, j, k) of a tensor A is represented as A{i, j, k}.
An important function of a matrix discussed is the transpose operation which is an operation on any matrix A is the transpose (denoted by A^T) which is its mirror image across the main diagonal.
One of the basic operation done on almopst all algorithms in machine learning is matrix multiplication. For a matrix multiplication between two matrices A{mn} and B{kp} to exist, n and k should be equal. The resulting matrix C (= AB) has the shape m x p.
Certain useful properties of matrices are discussed
A(B+C) = AB + AC (Distributive Law)
A(BC) = (AB)C (Associative Law)
AB <> BA (not equal and therefore not commutative)
(AB)^T = (B^T)(A^T)
x^T y = (x^T y)^T = y^T x (Transpose Rule)
Identity Matrix is a matrix which doesn't change a vector when multiplied by the vector. The entries along its main diagonal is 1. All other entries are zero. A inverse is denoted using A^-1 and the following rule is applied when inverse matrix is used Now we define the inverse of a matrix, A^-1 as: A^-1 A = A A^-1 = Inverse of Matrix
Linear combination of a matrix and span which is a set of vectors is the set of all points that can be obtained from the linear combination of the vectors.However if none of the vectors is a linear combination of the other vectors, then the set of vectors are said to be linearly indipendent
Norms are used ot define the size of a vector.Different types of norms are discussed below: Euclidean Norm: This is the L^2 norm, which is heavily used in machine learning, and can be also calculated as x^T x. L1 Norm: It is used when the difference between the zero and non-zero elements is very important. Max Norm (also known as the L^infinity norm which is also the absolute value of the largest magnitude in the vector Frobenius Norm: Used to measure the size of a matrix (similar to the L^2 norm)
Now that the common matrix operations have been discussed, certain special types of matrices are discussed Diagonal Matrices: These matrices have non-zero entries only along the main diagonal, e.g. the identity matrix In. Some of the key features are: A square diagonal matrix can be represented as: diag(v) where the vector v represents the elements along tha main diagonal, multiplying by a diagonal matrix is computationally efficient. Dx can be calculated by simply scaling each x_i by v_i and a diagonal matrix need not be square. Symmetric Matrix: A = A^T Unit vector: A vector which has unit norm, i.e. ||x||_2 = 1. Orthogonal vectors: Two vectors x and y are orthogonal if x^T y = 0, which means that if both of them have non-zero norm, these vectors are at a 90 degree angle to each other. Orthogonal vectors having unit norm are called orthonormal vectors. Orthogonal Matrix: A matrix whose rows are mutually orthonormal (and columns too). Thus: A^T A = A A^T = I whichy implies A^-1 = A^T For orthogonal matrices, the inverse is easy and not so resource intensive to compute
Probability Theory provides a mathematical framework for representing uncertainty. In AI applications, probability theory is used in two ways:
The law of probability specifies how an AI system should reason. For example, for those who are familiar with probability, a classification problem (e.g. given an image, classify whether the image is that of a "cat" or a "dog") can be viewed as finding P(Y|X), where X is the input data and Y is the label that we are predicting. We design our algorithms to compute or approximate (in case computing exact value is not feasible) various expressions derived using probability theory.
Use probability and statistics to analyze the behaviour of proposed AI systems. For example, we can analyze the accuracy of a classification model by observing how many of the predictions are correct.
Two kinds of probablity are discussed
There are two types of kinds of probability:
Frequentist Probability: Probability theory was originally developed to analyze the frequencies of events (which are often repeatable, e.g. drawing a certain hand of cards in a poker game). When we say that an outcome has a probability p of occuring, it means that if we repeated the experiment infinte times, then proportion p of those repititions would produce that outcome. This kind of probability, rerelated directly to the rates at which events occur, is called frequentist probability.
Bayesian Probability: The above reasoning doesn't seem applicable to experiments which are not repeatable, e.g. when a doctor says that a patient has 40% chance of having the flu, the probability represents a degree of belief, with 1 indicating absolute certainty that the patient has the flu and 0 indicating absoluting certainty that the patient doesn't have the flu. This kind of probability, related to qualitative levels of reasoning, is called Bayesian Probability.
Random VAriable: A random variable is a variable, e.g. x, that can take on different values (states) randomly. Since it takes on values randomly, there must be a probability associated with each of those values. Thus, a random variable must be coupled with a probability distribution that specifies how likely each of the states are.
There are two types of random variables:
Discrete: The number of states are finite opr close to infinity
Continuous: It is associated with a real value.
Probablity distribution: A probability distribution is a description of how likely a random variable or a set of random variables is, to take on each of its possible states.
Probability MAss Function: The probability distribution over discrete random variables is described using a probability mass function (PMF). A probability mass function acting over multiple variables is called a joint probability distribution.
When discussing probability variables for continous random variables a probability density function is used.
Sometimes we know the probability distribution over a set of variables and need to find the probability over just a subset of them. The probability over the subset is called marginal probability distribution.
Conditional Probablity: This is one of the most common types of probablity that basically means to calculate the probability of something happening assuming something else has already happened where it is indicated by P(Y|X). Condistional probablity is calculated using the bayseian classification formula
The expectation, or expected value, of some function f(x) with respect to a probability distribution P(x) is the average, or mean value, that the function f takes on, when x is drawn from P
Variance gives a measure of how much the values of random variable x vary
Correlation measures how two variables are linearly related. Covariance additionally measures the scale of the variables as well.
Certain probbality distributions are discussed such as bernouli distribution (a distribution over a single binary random variable), Multinoulli Distribution (similar to the Bernoulli distribution, with the difference being that the discrete random variable can have k different states) and Gaussian Distribution
The goal of analysing this dataset is to see which area ion the city of Bangalore, India serves the best dishes. This dataset is also used to predict the eating habits of Bangloreans and help aid potentials buisiness owners to expand or invest in their own restaurant.
https://www.kaggle.com/himanshupoddar/zomato-bangalore-restaurants
Reading and Verifying Data
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
dataset = pd.read_csv('zomato.csv')
dataset.head()
Dataset shows null values in the method below so data cleaning will be needed. Extra infomation about the dataset like the amount of data we are dealing with is also calculated
dataset.info()
dataset.shape
# Checking the amount of data to clean
dataset.isna().sum()
There are certain columns that will not contribute to our analysis and will have to be dropped to avoid noise in our dataset
dataset.drop(columns=['url', 'phone', 'dish_liked', 'address', 'reviews_list', 'menu_item'], inplace=True)
Now that the uncessary columns have been dropped. We can proceed with removing data that has any null values in it as such incomplete data will give us skewed results which is how we get uncessary outliers
dataset.dropna(how='any',inplace=True)
dataset.isna().sum()
To get an overview of the data, we will look at the best performing suburbs in the city in terms of the number of restaurants present
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
print(dataset.groupby(['location']).size().nlargest(5))
# Showing the 5 largest areas with market share
plt.figure(figsize=(12,12))
plt.pie(dataset.groupby(['location']).size().nlargest(5), labels=['BTM', 'HSR', 'Koramanga 5th Block', 'JP Nagar', 'Whitefield'])
plt.show()
# Showing the rest of the locations in Bangalore
plt.figure(figsize=(14,14))
locations=dataset['location'].value_counts()[:15]
sns.barplot(locations,locations.index)
plt.show()
Showing the type of dinner service best desribes what people are looking for in a restaurant
# Plotting the number of restaurants for each type in the dataset
plt.figure(figsize=(12,12))
sns.countplot(x=dataset['listed_in(type)'])
plt.title('Restaurent food type')
plt.xlabel('Restaurent Type')
plt.ylabel('Number of restaurants')
plt.show()
Showing number of restaurants who provide table booking and online ordering facility as they are in popular demand
# Checking which restauramts provide booking table facility
plt.title('Restaurents delivering online order?')
sns.countplot(x=dataset['online_order'])
fig=plt.gcf()
fig.set_size_inches(12,12)
We can see a large number of restaurants have the option of home deliv ery thus provng that for sucess, buisiness must have that option
# Checking which restauramts provide online ordering facility
plt.figure(figsize=(12,12))
plt.title('Restaurents delivering online order?')
sns.countplot(x=dataset['book_table'])
plt.show()
This is an interesting observation since most restaurants don't have the facility to book tables. There is scope for improvement here, sice businesses can use this data to allow the ability to book tables thus increasing their bsuiness
No city in the world is complete without its own variety of cuisines. We will now be looking into the cuisines that are popular with Bangloreans
# Get the popular 20 cuisines of the city
plt.figure(figsize=(12,12))
plt.title('Most popular cuisines in the city')
cuisines=dataset['cuisines'].value_counts()[:20]
sns.barplot(cuisines, cuisines.index)
plt.xlabel('Count')
plt.ylabel('Cuisines')
plt.show()
Bangalore is a city in the south of India. It is surprising to see that North Indian food is the most popular.
Checking which type of dinner service gives restaurants the most rating. We will be focussing on booking table facility and online ordering as they were the most popular features in our previous observations
# Plotting a multivariate bar graph to show how th ability to book tables online affects the restaurants rating
plt.figure(figsize=(14,14))
sns.countplot(x='rate',hue='book_table',data=dataset)
plt.show()
# Plotting a multivariate bar graph to show how th ability to order online online affects the restaurants rating
plt.figure(figsize=(14,14))
sns.countplot(x='rate',hue='online_order',data=dataset)
plt.show()
We can see from the above grpahs that restaurants have the highest rating when they offer booking table and online ordering facility for the simple reason that the type of cuisine prefered by Bangloreans are quick bites and done-outs
Indians care a lot about where they spend their money. So a distribution plot is used to show how the prices for dining (2 people)
dataset['approx_cost(for two people)']=dataset['approx_cost(for two people)'].apply(lambda x: int(x.replace(',','')))
plt.figure(figsize=(14,14))
sns.distplot(dataset['approx_cost(for two people)'])
plt.title('Approx Cost distribution for 2 people in rupees')
plt.show()
As shown in the distribution plot, Bangloreans don't prefer pating extremelt high or low prices for food. We can see a smooth distribition in the plot to indicate that.
The goal of this dataset is to monitor 14 gas sensors in a controlled envornment when they are exposed in humid conditions to a mixture of carbon monoxide and synthetic air. The sensors have monitored the differences in the CO concentrations, humidity and temperatures as the sensor data for each of the 14 gas sesnors is also recorded.
https://archive.ics.uci.edu/ml/datasets/Gas+sensor+array+temperature+modulation
NOTE: We are limiting our dataset to 10000 rows but the amount can be increased to nearly 20 times if needed. For the purposes of this assigment and to run the visualisations quickly we will be using a smaller dataset
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.animation import FuncAnimation
# Reading the gas sesnor dataset
dataset1 = pd.read_csv('20160930_203718.csv', nrows=10000)
dataset1.head()
Checking for any null values in the dataset and datatypes to ensure we can perform our analysis carefully
dataset1.info()
dataset1.shape
The first part of our observations will be to monitor 3 key aspects of out data
PART 1: Carbon Monoxide Concentrations
plt.figure(figsize=(30,12))
sns.regplot(x="Time (s)", y="CO (ppm)", data=dataset1)
plt.show()
PART 2: Humidity Level
Humidity levels are interesting since we can see how chnages in the CO concetrations and heater voltage affect the humidity level in the chamber
plt.figure(figsize=(30,12))
sns.scatterplot(x="Time (s)", y="Humidity (%r.h.)", data=dataset1)
plt.show()
PART 3: Temperature
Since the temperature is highly controlled due to it being a key factor in the reactivity in the gas chamber, we can see more or less a much stable temperature variance
plt.figure(figsize=(30,12))
sns.scatterplot(x="Time (s)", y="Temperature (C)", data=dataset1)
plt.show()
PART 4: Heater Voltage
NOTE: Heater voltage is taken using the first 200 values since the voltage constantly fluctutates and s smaller subset shows better variations
plt.figure(figsize=(30,12))
plt.plot(dataset1['Time (s)'][:200,], dataset1['Heater voltage (V)'][:200,])
plt.show()
Since we are dealing with the reading of 14 different gas sensors, writing the visualisation for each for consume a lot of memoty and would not give an accurate birds eye view. In order to get one we use a pair plot which not only shows the sensor readings of the 14 gas sensors. It also shows the input variables that were used to achieve them.
sns.pairplot(dataset1)
plt.show()
Doing this challenge has been a interesting ride. Having explored two completely different datasets has been an enjoyable experience. While one of the datasets was something that I was very familiar with since I has ised their services, it opened up a while lot about how business decsisions are made using data.
Arriving at an analysis for the regression task proved more challenging than I had hoped. The area of the dataset was unfamilar to me but going through the documentation and understanding the data made a task a little easier. Monitoing input and output variations from each individual sensor using a single graph was a eureka moment for me.
List your references here... Follow either MLA or APA style!
DO NOT forget to submit your data! Your notebook is supposed to run fine without any error. You don't need to run any ML algorithm. This assignment only asks reading, visualizing, and writing your observations from it.
Note: this is a WRITING assignment. Proper writing is REQUIRED. Comments are not considered as writing.
| Points | Description | |
|---|---|---|
| 10 | Introduction | |
| 20 | Review | |
| 10 | linear algebra | |
| 10 | probability theory | |
| 60 | Data | |
| 5 | Introduction of data for regression & source | |
| 5 | Reading the data | |
| 5 | Preprocessing of the data | |
| 10 | Visualization of the data | |
| 5 | Preliminary observation | |
| 5 | Introduction of data for Classification & source | |
| 5 | Reading the data | |
| 5 | Preprocessing of the data | |
| 10 | Visualization of the data | |
| 5 | Preliminary observation | |
| 5 | Conclusions | |
| 5 | References |